Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy clean up dangling index metadata log entry #558

Merged

Conversation

dai-chen
Copy link
Collaborator

@dai-chen dai-chen commented Aug 12, 2024

Description

This PR introduces a lazy approach to handling potentially corrupted indices. Previously only recoverIndex API has such capability introduced in #241 and this PR makes this logic generic across all Flint APIs.

A corrupted index is defined as a Flint index has an index log entry but no corresponding data index. Upon the next Flint API call, the system will check for such corrupted indices and perform the necessary cleanup, ensuring that users can recreate indices without encountering errors.

Typical Scenario

The most common scenario this PR aims to address is:

  1. Users delete OS data index for a Flint index directly
  2. Index monitor terminates streaming job if auto refresh enabled
  3. Users attempt to create index again
    a. Before changes: the creation failed and require to remove the log entry manually
    b. After changes: the log entry is cleaned up automatically and creation succeeds

Uncommon Scenarios and Handling

After step 1 above, user may rarely attempt to:

  1. Create or vacuum index again concurrently: Skip corruption check if index in CREATING or VACUUMIGN state to reduce the possibility of race condition, ensuring log entry won't be removed mistakenly during ongoing operation.
  2. Operate before index monitor terminates streaming job: If index corrupted, skip heartbeat reporting in index monitor which relies on log entry to reduce potential conflict.

Note: The changes aim to reduce the possibility of race conditions with best efforts. However, due to the lack of transaction support in OpenSearch, any checks performed before acquiring an optimistic lock are still subject to "dirty reads." This means there remains a small possibility of inconsistencies during concurrent operations.

Example

Prepare corrupted Flint index:

# Create an auto refresh Flint index
CREATE SKIPPING INDEX ON glue.default.http_logs (
  year PARTITION
)
WITH (
  auto_refresh = true,
  checkpoint_location = 's3://checkpoint-1'
);

# Delete OS index once all data refreshed
DELETE flint_glue_default_http_logs_skipping_index

# Index monitor detects and terminates streaming job
24/09/30 18:27:43 WARN FlintSparkIndexMonitor: Streaming job is active but data is deleted
24/09/30 18:27:43 INFO FlintSparkIndexMonitor: Terminating streaming job and index monitor for
 flint_glue_default_http_logs_skipping_index

# Flint index log entry is dangling in failed state
GET .query_execution_request_glue/_search
          "version": "1.0",
          "latestId": "ZmxpbnRfZ2x1ZV9kZWZhdWx0X2h0dHBfbG9nc19za2lwcGluZ19pbmRleA==",
          "type": "flintindexstate",
          "state": "failed",
          "applicationId": "XXX",
          "jobId": "YYY",
          "dataSourceName": "glue",
          "jobStartTime": 1727720050924,
          "lastUpdateTime": 1727720863284,
          "error": ""
        }

Verify cleanup logic works:

# Attempt to create the same index again
CREATE SKIPPING INDEX ON glue.default.http_logs (
  year PARTITION
)
WITH (
  auto_refresh = true,
  checkpoint_location = 's3://checkpoint-2'
);

# Index metadata log entry is cleaned up
24/09/30 18:34:35 INFO FlintSpark: Starting index operation 
[Create Flint index flint_glue_default_http_logs_skipping_index] with forceInit=true
24/09/30 18:34:35 WARN FlintSpark: 
 Cleaning up corrupted index:
 - logEntryExists [true]
 - dataIndexExists [false]
 - isCreatingOrVacuuming [false]
24/09/30 18:34:39 INFO FlintSpark: Index operation
[Create Flint index flint_glue_default_http_logs_skipping_index] complete

GET .query_execution_request_glue/_search
          "version": "1.0",
          "latestId": "ZmxpbnRfZ2x1ZV9kZWZhdWx0X2h0dHBfbG9nc19za2lwcGluZ19pbmRleA==",
          "type": "flintindexstate",
          "state": "refreshing",
          "applicationId": "XXX",
          "jobId": "YYY",
          "dataSourceName": "glue",
          "jobStartTime": 1727721279398,
          "lastUpdateTime": 1727723128616,
          "error": ""
        }

Issues Resolved

#356

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@dai-chen dai-chen added enhancement New feature or request 0.6 labels Aug 12, 2024
@dai-chen dai-chen self-assigned this Aug 12, 2024
@dai-chen dai-chen force-pushed the clean-up-dangling-metadata-log-entry branch from 1753521 to 1aa3a31 Compare September 3, 2024 17:22
@dai-chen dai-chen force-pushed the clean-up-dangling-metadata-log-entry branch from 1aa3a31 to 808fe54 Compare September 16, 2024 17:54
@dai-chen dai-chen marked this pull request as ready for review September 18, 2024 16:28
@dai-chen dai-chen force-pushed the clean-up-dangling-metadata-log-entry branch from 09c2e5c to a422c3f Compare September 30, 2024 16:34
Copy link
Collaborator

@noCharger noCharger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the PR also handle the issue where the create index fails with an opensearch timeout and the index status remains 'creating'?

Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
@dai-chen dai-chen force-pushed the clean-up-dangling-metadata-log-entry branch from 55d43d2 to c2ab09f Compare October 15, 2024 18:59
@dai-chen dai-chen requested a review from LantaoJin as a code owner October 15, 2024 18:59
@dai-chen dai-chen merged commit 8461ff9 into opensearch-project:main Oct 15, 2024
4 checks passed
@dai-chen dai-chen deleted the clean-up-dangling-metadata-log-entry branch October 15, 2024 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.6 enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants